Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics about the active/idle connections to ES nodes #141434

Merged

Conversation

gsoldevila
Copy link
Contributor

@gsoldevila gsoldevila commented Sep 22, 2022

As part of #134362, we want to have more visibility on the amount of open connections to ES nodes, for our different deployments.

With a cloud-first mindset, we are adding a new collector that will retrieve socket information from AgentManager, exposing this information on the /api/stats endpoint, so that it can be consumed by the Kibana Metric Beat, and sent over to our monitoring cluster.

Here's an example of what the new properties look like (extracted from a local environment):

image

@gsoldevila gsoldevila added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Stack Monitoring release_note:feature Makes this part of the condensed release notes telemetry Issues related to the addition of telemetry to a feature backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.6.0 labels Sep 22, 2022
@gsoldevila gsoldevila requested a review from a team as a code owner September 22, 2022 13:45
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@gsoldevila gsoldevila added backport:skip This commit does not require backporting and removed backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Sep 22, 2022
Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and a whole bunch of questions but I'll review again once CI goes green.

There will be some follow-up work to add these metrics to Cloud


expect(httpAgents.size).toEqual(2);
expect(httpsAgents.size).toEqual(2);
expect(httpAgents.has(agent1)).toEqual(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative:

expect([...httpAgents]).toEqual(expect.arrayContaining([agent1, agent2]));
expect([...httpAgents]).not.toEqual(expect.arrayContaining([agent3, agent4]));
expect([...httpsAgents]).toEqual(expect.arrayContaining([agent3, agent4]));
expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2]));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that would work

expect.arrayContaining(array) matches a received array which contains **all** of the elements in the expected array

expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2])); would be passing for [agent1, agent3] AFAIK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! I'll simplify what can be simplified then.

const httpAgents = new Set<HttpAgent>([new HttpAgent(), new HttpAgent()]);
const httpsAgents = new Set<HttpsAgent>([new HttpsAgent(), new HttpsAgent()]);

AgentManagerMock.mockImplementationOnce(() => ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the longer term, we might need a proper mock for AgentManager but for now, it's probably ok to inline here.

}

public reset() {
// TODO check if we have to implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably want to reset the collector on rolling restarts, to ensure we're not carrying over any stale metrics. Even if not needed, it's probably a good idea to reset the collector anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That answer the question, thanks!

expect(getAgentsSocketsStats).toHaveBeenCalledTimes(2);
expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(1, httpAgents);
expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(2, httpsAgents);
expect(metrics).toMatchInlineSnapshot(`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using snapshots in the metrics service has, in my experience, been a nightmare with flaky tests. We're already seeing failures for https://github.com/elastic/kibana/pull/141434/files#diff-3876c2ad50be4ead02241bceff9d5a9ee0d493a5fb53ff27c628cecc5d2716a6R72.

const agents = new Set<Agent>([new Agent(), new Agent()]);

const stats = getAgentsSocketsStats(agents);
expect(stats).toMatchInlineSnapshot(`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using snapshots has, in my experience, been flaky with the metrics service. We should be proactive and create test fixtures for these, similarly to how we test the process and event loop delays.

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall implementation looking good.

Just a bunch of remarks and NITs

Comment on lines 139 to 148
expect(httpAgents.size).toEqual(2);
expect(httpsAgents.size).toEqual(2);
expect(httpAgents.has(agent1)).toEqual(true);
expect(httpAgents.has(agent2)).toEqual(true);
expect(httpAgents.has(agent3)).toEqual(false);
expect(httpAgents.has(agent4)).toEqual(false);
expect(httpsAgents.has(agent1)).toEqual(false);
expect(httpsAgents.has(agent2)).toEqual(false);
expect(httpsAgents.has(agent3)).toEqual(true);
expect(httpsAgents.has(agent4)).toEqual(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think the order in these arrays are determined by the order of creation, right? So at this point, why just not check for exact array content?

expect(httpAgents).toEqual([agent1, agent2]);
expect(httpsAgents).toEqual([agent3, agent4]);

Comment on lines 82 to 88
public getHttpAgents(): Set<NetworkAgent> {
return this.httpStore;
}

public getHttpsAgents(): Set<NetworkAgent> {
return this.httpsStore;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can return Set<HttpAgent> for getHttpAgents and Set<HttpsAgent> for getHttpsAgents, can't we? Would be a little more precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N/A as I've merged both Agent types into a single set (commit coming soon).

@@ -12,6 +12,7 @@ import type {
ElasticsearchServiceStart,
ElasticsearchServiceSetup,
} from '@kbn/core-elasticsearch-server';
import { AgentManager } from '@kbn/core-elasticsearch-client-server-internal';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: import type

@@ -94,6 +95,7 @@ const createInternalSetupContractMock = () => {
level: ServiceStatusLevels.available,
summary: 'Elasticsearch is available',
}),
agentManager: new AgentManager(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @TinaHeiligers's remark: we'd ideally want a proper mocked version here instead of delegating to the actual implementation.

Note that given it is only exposed on the internal contract / mock, the impact is not that significant.

Comment on lines 15 to 21
jest.mock('@kbn/core-elasticsearch-client-server-internal');
jest.mock('./get_agents_sockets_stats');

const AgentManagerMock = AgentManager as jest.MockedClass<typeof AgentManager>;
const getAgentsSocketsStatsMock = getAgentsSocketsStats as jest.MockedFunction<
typeof getAgentsSocketsStats
>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: For consistency, would be better to use the pattern of declaring a test's mocks in a [file].test.mocks.ts file instead, and importing the mocked things from it, e.g packages/core/metrics/core-metrics-collectors-server-internal/src/os.test.mocks.ts

}

public reset() {
// TODO check if we have to implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?

Comment on lines 29 to 30
totalActiveSockets += sockets?.length ?? 0;
nodesWithActiveSockets[node] = (nodesWithActiveSockets[node] ?? 0) + (sockets?.length ?? 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I would set sockets?.length ?? 0; into a variable to avoid repeating it (same with freeSockets?.length ?? 0 below).

Comment on lines 58 to 65
this.metricsCollector = new OpsMetricsCollector(
http.server,
elasticsearchService.agentManager,
{
logger: this.logger,
...config.cGroupOverrides,
}
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna ask why you didnt add agentManager to the second param, but looking at it, I see we're using the OpsMetricsCollectorOptions both for OpsMetricsCollector and OsMetricsCollector construction.

Outside of the scope of this PR, but we should fix that, and have OpsMetricsCollector use its own single structure / type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #141922 for it

Comment on lines 46 to 50
describe('#start', () => {
it('invokes setInterval with the configured interval', async () => {
await metricsService.setup({ http: httpMock });
await metricsService.setup({ http: httpMock, elasticsearchService: esServiceMock });
await metricsService.start();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually missing a test to assert what parameters the OpsMetricsCollector instance is created with during #setup.

Optional given it wasn't here in the first place, but we could add one.

Comment on lines 48 to 49
/** Number of HTTP Agents that have are currently being used by the ES-js client */
agents: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can be http or https. would just remove this word from the description

export interface ElasticsearchClientsMetrics {
/** Number of HTTP Agents that have are currently being used by the ES-js client */
agents: number;
/** Number of ES instances (or proxies) that ES-js client is connecting to */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
/** Number of ES instances (or proxies) that ES-js client is connecting to */
/** Number of ES nodes that ES-js client is connecting to */

Kibana doesn't support connecting through a proxy.

Comment on lines 78 to 81
export interface ElasticsearchClientsMetricsByProtocol {
http: ElasticsearchClientsMetrics;
https: ElasticsearchClientsMetrics;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value in having separate metrics for http and https?
Is it even possible to configure Kibana to use https and http nodes? Is it even possible to run an Elasticsearch cluster with mixed http / https nodes?

As a first step we just want to use this data to understand and optimize the default performance on cloud. But if this is useful to us it's useful to customers too. But if there's two keys then all the dashboards built on this data will have to have two queries and it will always show e.g. "http.averageActiveSocketsPerNode" and "https.averageActiveSocketsPerNode" even if only one will have data. So this kinda makes the dashboard noisy.

Copy link
Contributor Author

@gsoldevila gsoldevila Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's technically possible, at least from a configuration standpoint. When users specify the hosts: property they could be introducing mixed http and https hosts.

I was thinking that we would mainly rely on "https" for dashboards, which will be the normal for our cloud deployments.

Then, if we find a faulty deployment with degraded performance, we could take a look and see if there are open sockets in the 'http' protocol? I'm not really sure how helpful it can be.

If you think it brings more noise than value I can simply remove the 'http' one, or try to merge them together into a single "store". Alternatively, from the Kibana metrics beat, we could simply consume the 'https' one. This way, the /api/stats will continue to expose both (which could be useful for investigations on self-managed), and the monitoring cluster will only ingest the 'https' one, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine it might be possible to configure mixed http/https nodes for Elasticsearch for high availability you might have mixed nodes while you switch nodes in your cluster one at a time to serving over https. But I suspect this is extremely rare as a long term configuration.

Mixed protocols means the elasticsearch.maxSockets configuration option behaves differently and Kibana would have double as many sockets as a pure http or pure https configuration. In such a case only seeing the https sockets in a dashboard would be misleading and I think it would be more useful to see all sockets to all protocols on a dashboard.

So it feels like merging the agents would always provide a more holistic picture of how many sockets your Kibana is opening. I'm not sure if knowing if it's https or http would make us make any different decisions, in some sense it feels like an implementation detail, but we could have a field like protocol: 'http' | 'https' | 'mixed'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, I'll update my PR accordingly, thanks!

else if (http) protocol = 'http';
else protocol = 'none';

return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where it happens but I tested this by creating a lot of idle connections and then getting /api/stats which returns:

...
    "elasticsearch_client": {
        "protocol": "https",
        "connected_nodes": 1,
        "nodes_with_active_sockets": 0,
        "nodes_with_idle_sockets": 1,
        "total_active_sockets": 0,
        "total_idle_sockets": 4494,
        "total_queued_requests": 0,
        "most_active_node_sockets": null,
        "average_active_sockets_per_node": null,
        "most_idle_node_sockets": 4494,
        "average_idle_sockets_per_node": 4494
    },
...

We would probably use a long mapping for these fields so then I think a value of null (like for average_active_sockets_per_node) would cause an indexing error on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'm adding some UT to cover these edge cases.

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few remarks, but looking good to me

@@ -26,7 +26,7 @@ import { ScopedClusterClient } from './scoped_cluster_client';
import { getDefaultHeaders } from './headers';
import { createInternalErrorHandler, InternalUnauthorizedErrorHandler } from './retry_unauthorized';
import { createTransport } from './create_transport';
import { AgentManager } from './agent_manager';
import { AgentFactoryProvider } from './agent_manager';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: import type

@@ -12,7 +12,7 @@ import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server';
import { parseClientOptions } from './client_config';
import { instrumentEsQueryAndDeprecationLogger } from './log_query_and_deprecation';
import { createTransport } from './create_transport';
import { AgentManager } from './agent_manager';
import { AgentFactoryProvider } from './agent_manager';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import type

* Side Public License, v 1.
*/

import { AgentStore, NetworkAgent } from '@kbn/core-elasticsearch-client-server-internal';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: import type

constructor(private readonly agentStore: AgentStore) {}

public async collect(): Promise<ElasticsearchClientsMetrics> {
return getAgentsSocketsStats(this.agentStore.getAgents());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: return await getAgentsSocketsStats for better stacktraces

Comment on lines 15 to 17
jest.doMock('http');
const agent = new HttpAgent();
return Object.assign(agent, defaults);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we doing here exactly? what's the intent of the in-function doMock call?

Copy link
Contributor Author

@gsoldevila gsoldevila Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was to mock HttpAgent and allow overriding a few of its properties (mainly the lists of sockets and freeSockets). There's probably a better way to do it 😬

UPDATE: I need a "real" instance so that the code in getAgentsSocketsStats that checks instanceof HttpsAgent does not fail for the test.

UPDATE 2: I removed the doMock(..) statements. They're unnecessary cause I don't need to mock any methods of the class afterwards.

Comment on lines 60 to 65
let protocol: ElasticsearchClientProtocol;

if (http && https) protocol = 'mixed';
else if (https) protocol = 'https';
else if (http) protocol = 'http';
else protocol = 'none';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: The dark side of the inlining is not strong in this one.

const protocol: ElasticsearchClientProtocol = http ? https ? 'mixed' : 'http' ? https ? 'https' : 'none';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮

@kibana-ci
Copy link
Collaborator

kibana-ci commented Oct 4, 2022

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #36 / analytics instrumented events from the browser Loaded Dashboard full loaded dashboard should emit the "Loaded Dashboard" event when done loading complex dashboard
  • [job] [logs] FTR Configs #36 / analytics instrumented events from the browser Loaded Dashboard full loaded dashboard should emit the "Loaded Dashboard" event when done loading complex dashboard

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-elasticsearch-client-server-internal 11 13 +2
@kbn/core-elasticsearch-client-server-mocks 32 34 +2
@kbn/core-metrics-collectors-server-internal 20 25 +5
@kbn/core-metrics-server-internal 5 6 +1
@kbn/core-metrics-server-mocks 7 19 +12
total +22

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/core-elasticsearch-client-server-internal 1 2 +1
Unknown metric groups

API count

id before after diff
@kbn/core-elasticsearch-client-server-internal 13 15 +2
@kbn/core-elasticsearch-client-server-mocks 36 38 +2
@kbn/core-metrics-collectors-server-internal 24 29 +5
@kbn/core-metrics-server 48 62 +14
@kbn/core-metrics-server-internal 5 6 +1
@kbn/core-metrics-server-mocks 7 19 +12
core 2686 2687 +1
total +37

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gsoldevila gsoldevila merged commit 25b79a9 into elastic:main Oct 4, 2022
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 11, 2022
…c#141434)

* Collect metrics about the connections from esClient to ES nodes

* Misc enhancements following PR remarks and comments

* Fix UTs

* Fix mock typings

* Minimize API surface, fix mocks typings

* Fix incomplete mocks

* Fix renameed agentManager => agentStore in remaining UT

* Cover edge cases for getAgentsSocketsStats()

* Misc NIT enhancements

* Revert incorrect import type statements
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 14, 2022
…c#141434)

* Collect metrics about the connections from esClient to ES nodes

* Misc enhancements following PR remarks and comments

* Fix UTs

* Fix mock typings

* Minimize API surface, fix mocks typings

* Fix incomplete mocks

* Fix renameed agentManager => agentStore in remaining UT

* Cover edge cases for getAgentsSocketsStats()

* Misc NIT enhancements

* Revert incorrect import type statements
gsoldevila added a commit that referenced this pull request Nov 14, 2022
[PR #141434](#141434) exposes a
bunch of metrics related to the Elasticsearch Client in the `/api/stats`
endpoint.
While all these stats are interesting, some of them might be less
relevant than others right now.

Let's start by exposing only those stats that are more critical from a
monitoring standpoint.

<img width="440" alt="image"
src="https://user-images.githubusercontent.com/25349407/201688243-4e33cd88-5fa2-48b7-b8ca-2fd175271adc.png">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Stack Monitoring release_note:feature Makes this part of the condensed release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc telemetry Issues related to the addition of telemetry to a feature v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants